05. Text Processing, Bag of Words
Text Processing
I mentioned that one of our tasks will be to convert any user input text into data that our deployed model can see as input. You've seen a few examples of text pre-processing and the steps usually go something like this:
- Get rid of any special characters like punctuation
- Convert all text to lowercase and split into individual words
- Create a vocabulary that assigns each unique word a numerical value or converts words into a vector of numbers
This last step is often called word tokenization or vectorization.
And in the next example, you'll see exactly how I do these processing steps; I'll also be vectorizing words using a method called bag of words . If you'd like to learn more about bag of words, please check out the video below, recorded by another of our instructors, Arpan!
Bag of Words
Bag Of Words
You can read more about the bag of words model, and its applications, on this page . It's a useful way to represent words based on their frequency of occurrence in a text.
Quiz
For the following quiz questions, consider a "document" that is just the following sentence:
At a very basic level, I think people need the opportunity to learn and to grow.